This report explores a dataset containing 4,898 white wines with 11 variables on quantifying the chemical properties of each wine as well as quality scores between 0 (very bad) and 10 (very excellent).

Reference:

  1. ref1_wine
  2. ref2_wiki
  3. ref3_baidu
  4. ref4_RadomForest

Univariate Plots Section

## 'data.frame':    4898 obs. of  12 variables:
##  $ fixed.acidity       : num  7 6.3 8.1 7.2 7.2 8.1 6.2 7 6.3 8.1 ...
##  $ volatile.acidity    : num  0.27 0.3 0.28 0.23 0.23 0.28 0.32 0.27 0.3 0.22 ...
##  $ citric.acid         : num  0.36 0.34 0.4 0.32 0.32 0.4 0.16 0.36 0.34 0.43 ...
##  $ residual.sugar      : num  20.7 1.6 6.9 8.5 8.5 6.9 7 20.7 1.6 1.5 ...
##  $ chlorides           : num  0.045 0.049 0.05 0.058 0.058 0.05 0.045 0.045 0.049 0.044 ...
##  $ free.sulfur.dioxide : num  45 14 30 47 47 30 30 45 14 28 ...
##  $ total.sulfur.dioxide: num  170 132 97 186 186 97 136 170 132 129 ...
##  $ density             : num  1.001 0.994 0.995 0.996 0.996 ...
##  $ pH                  : num  3 3.3 3.26 3.19 3.19 3.26 3.18 3 3.3 3.22 ...
##  $ sulphates           : num  0.45 0.49 0.44 0.4 0.4 0.44 0.47 0.45 0.49 0.45 ...
##  $ alcohol             : num  8.8 9.5 10.1 9.9 9.9 10.1 9.6 8.8 9.5 11 ...
##  $ quality             : Factor w/ 7 levels "3","4","5","6",..: 4 4 4 4 4 4 4 4 4 4 ...
##  fixed.acidity    volatile.acidity  citric.acid     residual.sugar  
##  Min.   : 3.800   Min.   :0.0800   Min.   :0.0000   Min.   : 0.600  
##  1st Qu.: 6.300   1st Qu.:0.2100   1st Qu.:0.2700   1st Qu.: 1.700  
##  Median : 6.800   Median :0.2600   Median :0.3200   Median : 5.200  
##  Mean   : 6.855   Mean   :0.2782   Mean   :0.3342   Mean   : 6.391  
##  3rd Qu.: 7.300   3rd Qu.:0.3200   3rd Qu.:0.3900   3rd Qu.: 9.900  
##  Max.   :14.200   Max.   :1.1000   Max.   :1.6600   Max.   :65.800  
##                                                                     
##    chlorides       free.sulfur.dioxide total.sulfur.dioxide
##  Min.   :0.00900   Min.   :  2.00      Min.   :  9.0       
##  1st Qu.:0.03600   1st Qu.: 23.00      1st Qu.:108.0       
##  Median :0.04300   Median : 34.00      Median :134.0       
##  Mean   :0.04577   Mean   : 35.31      Mean   :138.4       
##  3rd Qu.:0.05000   3rd Qu.: 46.00      3rd Qu.:167.0       
##  Max.   :0.34600   Max.   :289.00      Max.   :440.0       
##                                                            
##     density             pH          sulphates         alcohol     
##  Min.   :0.9871   Min.   :2.720   Min.   :0.2200   Min.   : 8.00  
##  1st Qu.:0.9917   1st Qu.:3.090   1st Qu.:0.4100   1st Qu.: 9.50  
##  Median :0.9937   Median :3.180   Median :0.4700   Median :10.40  
##  Mean   :0.9940   Mean   :3.188   Mean   :0.4898   Mean   :10.51  
##  3rd Qu.:0.9961   3rd Qu.:3.280   3rd Qu.:0.5500   3rd Qu.:11.40  
##  Max.   :1.0390   Max.   :3.820   Max.   :1.0800   Max.   :14.20  
##                                                                   
##  quality 
##  3:  20  
##  4: 163  
##  5:1457  
##  6:2198  
##  7: 880  
##  8: 175  
##  9:   5

## 
##    3    4    5    6    7    8    9 
##   20  163 1457 2198  880  175    5

Most wine quality is between 5 to 7 and wine with 3 score or 9 is rare. We also can find that there is no wine whose quality is below 3 or above 9.

Fixed acid(mainly tartaric acid) has a approximately normal distribution, most concentrate between 6 to 7.5. According to reference, tartaric acid can keep the chemical stability and wine color, affecting the taste of the finished produc. As tartaric acid is very acid, high volume will make wine taste rough.

## [1] 66

This is an approximately normal distribution with a little right skew. We can see there are 66 wines containing too high level of acetic acid, more than 0.6/L, which can lead to an unpleasant, vinegar taste. This negative effect could help to distinguish poor quality wines.

According to the International organization of wine, citric acid content must not exceed 1g/L. But we find a weird peak at 0.49 not around 1, which I can’t explain now. And I wonder if it has something to do with the quality of wine.

After using a log10 transformation on the x-axis, a bimodal distribution appears, having two peaks round 1.6 and 10, a bottom round 3.3. I guess this is caused by different kinds of wine varying in the amount of residual sugar, such like dry wine, sweet wines.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
## 0.00900 0.03600 0.04300 0.04577 0.05000 0.34600

Most wine have a content of chlorides below 0.1 and the third quartile is 0.05.

## [1] 868

SO2 is mostly undetectable in wine, but at free SO2 concentrations over 50 ppm, SO2 becomes evident in the nose and taste of wine, which could be a negative effect to the quality of wine. I’m going to use a rough approximation of ppm by using mg/L. Then, we find 868 wines over the limit.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##     9.0   108.0   134.0   138.4   167.0   440.0

The distribution of total.sulfur.dioxide has more variance but less outliers than free.sulfur.dioxide.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.9871  0.9917  0.9937  0.9940  0.9961  1.0390

Most wine have a density between 0.99 and 1.00. I guess it may have relations with alcohol and residual sugar content.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   2.720   3.090   3.180   3.188   3.280   3.820

This is the most standard normal distribution by far in this section. PH should be influenced by fixed.acidity.

Sulphates is a wine additive which can contribute to sulfur dioxide gas (S02) levels.

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    8.00    9.50   10.40   10.51   11.40   14.20

According to reference, alcohol has double effect on wine taste: One hand, only if the alcohol content is higher than 11% (v/v), mellowness of wine can be evident. Alcohol content below 10% (v/v) will make the wine taste flat instead of fat. The other hand, the high alcohol content above 14% will be evident, meanwhile bringing uncomfortable feelings, like strong hotness and bitter.

Univariable Analysis

What’s the structure of your dataset?

There are 4898 wine samples in the dataset with 12 features(fixed.acidity, volatile.acidity, citric.acid, residual.sugar, chlorides, free.sulfur.dioxide, total.sulfur.dioxide, density, pH, sulphates, alcohol, quality). The variable quality is ordered factor variable with the followinf levels.
(worst)———————>(best)
quality:0,1,2,3,4,5,6,7,8,9,10

Other observations:
* Most wine quality is between 5 to 7.
* A notable peak for citric acid at 0.49.
* 66 wines containing too high level of acetic acid, more than 0.6/L.
* 868 wines containing free SO2 concentrations over 50 ppm.
* The median alcohol for a wine is 10.4 and the max is 14.20.
* Most wine have chlorides less than 1g/dm^3.
* Most wine have residual sugar less than 20g/dm^3.

What are the main features of interest in your dataset?

The main features are quality and alcohol(A guess according to refs). I’d like to train a model to classify the quality of a wine. Alcohol should play an important role.

What other features in the dataset do you think will help support your investigation into your feature(s) of interest?

Volatile.acidity, free.sulfur.dioxide, residual.sugar and citric.acid likely contribute to the quality of a wine. But for now, I can’t tell which one contribute more.

Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?

There is a citric.acid peak at 0.49g/dm^3 in the distribution, which I feel confused.
I log_transformed the right skewd residual.sugar distribution. The transformed distribution appears bimodal with two peaks around 1.6 and 10, a bottom round 3.3.

Bivariate Plots Section

Looking at the left subplots, we can see different dsitributions in groups divided by quality, especially alcohol, volatile.acidity and free.sulfur.dioxide, total.sulfur.dioxide and residual.sugar.

The median of wine alcohol decreases from group of 3 scores to 5 scores, then quickly increases till the last group, which has the highest median 12.5. It seems that high quality wine tends to have higher alcohol content.

## 
##    3    4    5    6    7    8    9 
##   19  142 1433 2182  878  173    5
## 
##  3  4  5  6  7  8  9 
##  1 21 24 16  2  2  0

In the first plot, the volatile.acidity rises from wine group of 3 scores to 4 scores, then decreases slowly. And the second group has the highest volatile acidity and max variance.
I divide dataset into two parts depend on whether volatile acidity content of wine is more than 0.6 g/dm^3 or not. Comparing these two parts, wine with high volatile acidity can hardly get a good score equal to or more than 7 and most get scores between 4 to 6.

## 
##    3    4    5    6    7    8    9 
##   15  149 1108 1808  795  151    4
## 
##   3   4   5   6   7   8   9 
##   5  14 349 390  85  24   1

The medians of each groups are quaite close except wine with 4 scores or 9 scores. The second group has the lowest median followed by the last group.
I divide dataset into two parts depend on whether free SO2 content of wine is more than 50 mg/L or not. And I find most wine with high free.sulfur.dioxide over 50 mg/L get a medium-quality between 5 scores to 6 scores.

Considering the components of total sulphur dioxide are free and bound forms of sulphur dioxide, I create a new feature named bound.sulfur.dioxide by substract free.sulfur.dioxide from total.sulfur.dioxide, and plot boxplot of bound.sulfur.dioxide.
The partten of free SO2 distribution is really similar to total SO2. In the first three quality groups, the medians decrease first and then increase. In the following groups, the medians of free SO2 and total SO2 both declines, but the latter reduces with more extent, which could be explained by the bound.sulfur.dioxide distribution.
So, I’d like to use free.sulfur.dioxide and bound.sulfur.dioxide as substitution of total.sulfur.dioxide to build my classify model later.

In former analysis, we find bimodal residual.sugar distribution after log_trandformation. I wonder if it has anything to do with quality, so I plot the histograms of residual.sugar faceted by quality. All groups appears bimodal except the top and bottom groups due to few samples. I think that it’s more likely to be a common phenomonon and really has little to do with wine quality. And the wine variety in sugar amount may be an explanation.

The residual.sugar distribution is similar to free SO2, declines, rises, declines, and the wine group of 4 score has the lowest median.

The density distribution across quality is quite similar to residual.sugar. And density has strong relations with alcohol, total.sulfur.dioxide, fixed.acidity. I’m going to build a linear model to predict and replace density in multivariate analysis section.

I plot citric.acid distribution faceted by quality, and we can see that all groups have a peak at 0.49 except the top and bottom groups. This unusual peak can not be explained by quality differences.
The median of each group are very close, meanwhile, the groups of score 4 and 9 have the min and max median respectively.

Though the median in each group is quite close, the first and the last group is slightly higher.

With the quality rises across groups, the median pH increases except the first group. And wine of 9 scores has the highest median.

Though the differences between medians of quality groups are small, wine of 9 scores are likely to have less chlorides. And wine of middle quality have much more variance in chlorides.

The median in each groups is quite close, but wine of fair quality have more variance in sulphates.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$alcohol and wine$density
## t = -87.255, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  -0.7908646 -0.7689315
## sample estimates:
##        cor 
## -0.7801376

The density of wine negatively correlates to the alcohol content, and the correlation coefficient is -0.78.

## 
##  Pearson's product-moment correlation
## 
## data:  wine$residual.sugar and wine$density
## t = 107.87, df = 4896, p-value < 2.2e-16
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
##  0.8304732 0.8470698
## sample estimates:
##       cor 
## 0.8389665

The density of wine positively correlates to the alcohol content, and the correlation coefficient is 0.84.

This matrix plot indicates other features(total.sulfur.dioxide, free.sulfur.dioxide, bound.sulfur.dioxide, fixed.acidity) related to density. Total SO2 and bound SO2 both have a moderately positive relationship with density. Free SO2 and fixed acidity both have a slightly positive relationship with density.

Bivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?

Quality correlates strongly with alcohol, volatile.acidity and free.sulfur.dioxide.

In the range of fair or high quality wine, the better the wine is, the more alcohol it has. But, in the range of low quality wine, wine of 4 scores has a lower median than wine of 3 scores.

Only four wine with more than 0.6 g/dm^3 volatile acidity, get scores more than 7. Therefore Wine with high volatile acidity can hardly get a good score.

Most wine with high free.sulfur.dioxide over 50 mg/L get a medium-quality between 5 scores to 6 scores.

Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?

The density of wine is strongly correlated with alcohol and residual.sugar. The higher alcohol wine has, then the lower density. The more residual sugar wine has, then the higer density. Besides, total SO2, bound SO2, free SO2 and fixed acidity have a weaker positive relationship with density than residual sugar.

Multivariate Plots Section

This above plots elaborate the former phenomenon that more residual sugar and less alcohol, then higher density. And vice versa.

Next, I start to build the linear model to predict the density of wine.

## 
## Calls:
## m1: lm(formula = density ~ alcohol + residual.sugar + total.sulfur.dioxide, 
##     data = wine)
## m2: lm(formula = density ~ alcohol + residual.sugar + free.sulfur.dioxide + 
##     bound.sulfur.dioxide, data = wine)
## m3: lm(formula = density ~ alcohol + residual.sugar + bound.sulfur.dioxide, 
##     data = wine)
## m4: lm(formula = density ~ alcohol + residual.sugar + bound.sulfur.dioxide + 
##     fixed.acidity + free.sulfur.dioxide, data = wine)
## m5: lm(formula = density ~ alcohol + residual.sugar + bound.sulfur.dioxide + 
##     fixed.acidity, data = wine)
## 
## ====================================================================================
##                             m1          m2          m3          m4          m5      
## ------------------------------------------------------------------------------------
##   (Intercept)            1.003***    1.003***    1.003***    0.999***    0.999***   
##                         (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     
##   alcohol               -0.001***   -0.001***   -0.001***   -0.001***   -0.001***   
##                         (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     
##   residual.sugar         0.000***    0.000***    0.000***    0.000***    0.000***   
##                         (0.000)     (0.000)     (0.000)     (0.000)     (0.000)     
##   total.sulfur.dioxide   0.000***                                                   
##                         (0.000)                                                     
##   free.sulfur.dioxide               -0.000***               -0.000***               
##                                     (0.000)                 (0.000)                 
##   bound.sulfur.dioxide               0.000***    0.000***    0.000***    0.000***   
##                                     (0.000)     (0.000)     (0.000)     (0.000)     
##   fixed.acidity                                              0.001***    0.001***   
##                                                             (0.000)     (0.000)     
## ------------------------------------------------------------------------------------
##   R-squared                  0.911       0.915       0.914       0.935       0.935  
##   adj. R-squared             0.911       0.915       0.914       0.935       0.935  
##   sigma                      0.001       0.001       0.001       0.001       0.001  
##   F                      16738.603   13217.503   17440.713   14157.987   17649.239  
##   p                          0.000       0.000       0.000       0.000       0.000  
##   Log-likelihood         27448.397   27564.052   27540.255   28226.250   28219.529  
##   Deviance                   0.004       0.004       0.004       0.003       0.003  
##   AIC                   -54886.794  -55116.104  -55070.510  -56438.500  -56427.058  
##   BIC                   -54854.311  -55077.124  -55038.027  -56393.023  -56388.079  
##   N                       4898        4898        4898        4898        4898      
## ====================================================================================

The fifth linear model can account 93.5% of the variance in the density of wine, so I’d like to use the combination of alcohol, residual.sugar, bound.sulfur.dioxide and fixed.acidity to replace density when build model to predict the quality of wine.

## Source: local data frame [4 x 3]
## Groups: high.volatile.acidity [?]
## 
##   high.volatile.acidity high.free.sulfur.dioxide     n
##                   <chr>                    <chr> <int>
## 1 volatile.acidity<=0.6  free.sulfur.dioxide<=50  3970
## 2 volatile.acidity<=0.6   free.sulfur.dioxide>50   862
## 3  volatile.acidity>0.6  free.sulfur.dioxide<=50    60
## 4  volatile.acidity>0.6   free.sulfur.dioxide>50     6

Wine is divided into 4 groups:

data size features
low free SO2, low volatile acidity 3970 The quality distribution is similar to the whole.
high free SO2, low volatile acidity 862 Most wine get score between 5 to 6, no wine get 9.
low free SO2, high volatile acidity 60 Most wine get score between 5 to 7, few get 9.
high free SO2, high volatile acidity 6 Most wine get score between 4 to 6, no wine get 3 or more than 7.

In short, wine with high free SO2 or high volatile acidity are much less likely to have a high quality.

Next, I start to build the classify model.

## 
## Call:
##  randomForest(formula = quality ~ ., data = new_wine, mtry = 10,      ntree = 170) 
##                Type of random forest: classification
##                      Number of trees: 170
## No. of variables tried at each split: 10
## 
##         OOB estimate of  error rate: 28.26%
## Confusion matrix:
##    3  4    5    6   7  8  9 class.error
## 3 40  0    0    0   0  0  0   0.0000000
## 4  0 49   74   40   0  0  0   0.6993865
## 5  0 13 1048  382  14  0  0   0.2807138
## 6  0  2  278 1771 143  4  0   0.1942675
## 7  0  2   17  321 528 12  0   0.4000000
## 8  0  0    2   46  44 83  0   0.5257143
## 9  0  0    0    0   0  0 20   0.0000000

##                      MeanDecreaseGini
## fixed.acidity                280.5509
## volatile.acidity             366.2120
## citric.acid                  281.9894
## residual.sugar               324.5029
## chlorides                    296.9875
## free.sulfur.dioxide          363.8392
## pH                           315.3327
## sulphates                    289.7706
## alcohol                      484.9548
## bound.sulfur.dioxide         347.6010
##       predicted
## actual   3   4   5   6   7   8   9
##      3  11   0   0   0   0   0   0
##      4   0  49   0   0   0   0   0
##      5   0   0 428   0   0   0   0
##      6   0   0   0 621   0   0   0
##      7   0   0   0   0 271   0   0
##      8   0   0   0   0   0  55   0
##      9   0   0   0   0   0   0   3

I build a RandomForest model to classify the quality of wine. In order to solve the unbalance of data set, I add wine data with 3 scores one time and wine data with 9 scores three times. Though all the prediction of my test data are the same as actual, the model error rate is 28.26%. Wine with 4, 7, 8 scores are the top three difficult group to predict. And the most important feature is alcohol followed by volatile.acidity and free.sulfur.dioxide.

Multivariate Analysis

Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?

The third picture shows the impact of high SO2 and volatile acidity to the quality of wine. Wine with high free SO2 or high volatile acidity are much less likely to have a high quality. Wine with both high free SO2 and high volatile acidity only get score between 4 to 6, no more than 7.

Were there any interesting or surprising interactions between features?

The first plot show the relationships in residual sugar, alcohole and density. More residual sugar and less alcohol, then higher density. And vice versa. The next matrix plot shows relationships between density and other features. Total.sulfur.dioxide and bound.sulfur.dioxide both have a moderately positive relation with density. Free.sulfur.dioxide and fixed.acidity both have a slightly positive relation with density.

OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.

Yes, I buid two models. The first is a linear model to predict the density of wine and can account 93.5% of the variance in the density of wine.

The second is a RandomForest model to classify the quality of wine. All my test data are predicted correctly, but the model error rate is 28.26%. Wine with 4, 7, 8 scores are difficult to predict. Another limitation is that it can not predict neither wine with 10 scores nor less than 3. This is due to the absence of the corresponding samples.

Final Plots and Summary

Plot One

Discription One

The residual sugar distribution of wine appears to be bimodal on log scale, as well as being faceted by quality. It perhaps due to the preference of residual sugar content varying in two different ranges, such like dry wine, sweet wines. There are two peaks round 1.6 and 10 points, a bottom round 3 points.

Plot Two

Discription Two

According to the International organization of wine, citric acid content must not exceed 1g/L. But there is a weird peak at 0.49 not around 1. After faceted by quality, we can see that all groups have a peak at 0.49 except the first and the last groups, which means this unusual peak can not be explained by quality differences. I’m still confused about the weird peak.
In whole distribution, most wines have a citric acid content between 0.2 and 0.5g/dm^3 and the median citric acid content is 0.32 closed to the mean 0.3341915. 307 wines have a citric acid content 0.3g/dm^3 making the highest peak.

Plot Three

Discription Three

The plot indicates the impact of high free SO2 and volatile acidity to the quality of wine. Higher free SO2 and volatile acidity the wine contain, the less possible for high quality. Look at the sub-plot in the bottom right corner, wine with both high free SO2 and high volatile acidity only get score between 4 to 6, no more than 7.

Reflection

The Wine data set contains 4898 wine samples across 12 variables. I started by googling the meanings of variables and influences to wine quality. Then I observed the single variable distributions and explored the quality across many variables. I separated total.sulfur.dioxide into free and bound two parts. After studying the relation between density and other features, I builded a linear model to replace density variable. Eventually, I build a RandomForest model to classify the quality of wine.

At first, I thought fixed.acidity meight be one of the most relative features to quality. But the median of fixed acidity in each quality group were quite close, making fixed acidity less important. I explored the quality of wines across variables and found the medians of alcohol content were quite different in groups. It declined, then increased to the highest point 12.5. Wine with 9 scores tended to have higer alcohol content and wine with 5 scores less. When I separated data set into four parts(low free SO2 & low volatile acidity, high free SO2 & low volatile acidity, low free SO2 & high volatile acidity, high free SO2 & high volatile acidity) and ploted the quality distribution, it becomed so obvious that wine with high free SO2 or high volatile acidity are much less likely to have a high quality. As for the RandomForest model I made at last, I used 10 features(alcohol, volatile.acidity, free.sulfur.dioxide, bound.sulfur.dioxide, residual.sugar, pH, chlorides, sulphates, citric.acid, fixed.acidity) and all wine samples were included.

The unbalance problem of data set is serious. This data set contain 4898 wines, but only 30 wines of 3 scores, 5 wines of 9 scores and no wines of 1 or 2 or 10 scores. I added wine data of 3 scores one time and wine data of 9 scores three times before training model. But, I still couldn’t dealing with the absence and the model wouldn’t recognize these three quality categories. Besides, this classify model has a 28.26% error rate, mainly due to the poor recognization performance of the medium-quality wines(4 scores to 7 scores). In the further analysis, informations of absent wine should be added. And I should explore more features to distinguish the medium-quality wines in detail.